Tunnel troubleshooting

Overview

API steps, browser steps running remotely, and pollers can run via a Legion tunnel installed in the customer network. If one of these flows fails, use this guide to debug the tunnel flow and identify if/where the issue is.

To find the problem we'll check each hop along the way to see that it's working correctly:

Call sender telemetry
Proxy mapping configuration
Dedicated proxy VM (on our side - AWS EC2 or GCP Compute Engine)
Tunnel container (on customer side)

References to understand how the E2E flow works: Tunnel architecture, Setup customer tunnel - proxy mapping

1. Call sender telemetry

Check the Axiom telemetry logs to validate that a tunnel proxy was used for the failed call and with what proxy URL (e.g AWS: http://ip-172-11-12-13.us-east-1.compute.internal:1380, GCP: http://10.10.0.25:1380)

If a tunnel proxy was not used when one is expected, the issue is in the proxy mapping and not the tunnel infra itself. Jump to the next step and finish the investigation there. If logs show no tunnel proxy was used and tunnel isn't expected to be used then the failure likely isn't related to the tunnel.

2. Proxy mapping configuration

Open the proxies mgmt page for the customer's region (e.g https://usw2.legionsecurity.ai/mgmt/proxy) and look at the mapping configuration for the customer's org id.
Find the applicable mapping configuration for the failed call - by either tool or call URL domain. Note that domain mapping isn't applied for pollers.
Verify the mapping is configured correctly:
- Tool name or URL matches the failing call
- Mapping type is 'tunnel'
- Proxy URL matches the one logged in the telemetry in step 1 exactly

If the mapping is missing/wrong, fix mapping first and retest before moving to Step 3.

3. Dedicated proxy VM

First we validate the proxy instance URL is correct:

Locate the dedicated proxy instance/VM for the org in the relevant cloud and region/project. The org id is the prefix of the instance name.
- AWS: EC2 -> Instances (customer region), then select the proxy EC2 instance
- GCP: Compute Engine -> VM instances (project for the customer region), then select the proxy VM
Get the instance internal host value:
- AWS: Private IP DNS name (IPv4 only) from the instance summary
- GCP: internal DNS name or internal IP from VM details
Verify the internal host value matches the proxy mapping exactly (expected format: http://<proxy-instance-internal-host>:1380).

Then we check the proxy is running and has no errors:

Connect to the proxy VM and check the MITM proxy status:
- AWS: EC2 -> Connect -> Session Manager
- GCP: Compute Engine -> VM instances -> SSH -> Open in browser window
```
sudo systemctl status mitmproxy
```
Expected result: active (running) in green.
Inspect the MITM proxy service logs logs:
```
sudo journalctl -u mitmproxy.service
```
What to confirm:
- MITM proxy is showing the failed request. If it doesn't, the issue is in the mapping or security between caller and proxy VM and there's no need to continue to the next investigation steps.
- Check if the failed requests shows any verbose error in mitm, like a python exception. If it does - fix the code issue in the tunnel repo and follow the tunnel installation guide on how to update and restart the proxy code in the instance.the requests are being repacked and forwarded to the customer Cloudflare tunnel DNS
- If the failed request only shows a failure HTTP status code (most likely 502), proceed to the next step
If needed, restart the service after fixing the code bug:
```
sudo systemctl restart mitmproxy
sudo systemctl status mitmproxy
```

4. Tunnel container

Go back to the proxies mgmt page for the customer's region (e.g https://usw2.legionsecurity.ai/mgmt/proxy) and locate the customer's proxy mapping again by its org id
Run a health ping on the tunnel to verify the tunnel container on the customer side is running correctly
Pull tunnel-related telemetry/logs for the failed with filter on 'warning' severity and above
Look for errors indicating where the flow breaks:
- Upstream target connectivity/TLS issues
- Code exception while handling the proxied call

If a failure log was found in the tunnel telemetry - fix the code bug (if it's an issue on our end) and deploy an updated tunnel image that the customer will need to update on their end, or look into the upstream issue and how we can handle it.

If no failure log is found, try changing the severity filter to look at all 'debug' severity and above and look for DLP data mask replacements that happened during the relevant time window. It's possible a faulty data mask regex replaced values it shouldn't have and caused failures (e.g break a browser page's JS files structure). If that is the case, use the data mask read/write mechanism (VERY CAREFULLY!) in the mgmt page to fix the faulty regex.

Tunnel troubleshooting

Overview​

1. Call sender telemetry​

2. Proxy mapping configuration​

3. Dedicated proxy VM​

4. Tunnel container​

Overview

1. Call sender telemetry

2. Proxy mapping configuration

3. Dedicated proxy VM

4. Tunnel container